This is an IMDb data set containing 40000 movie reviews. Each row
contains two columns with the text of the review and label. Our goal is
to categorize reviews into sentiments and predict genre of the movie the
review is about. We will attempt to accomplish this with tidyverse
tools, sentiment analysis, tf-idf, correlations of bigrams, and LDA
topic modeling. We start by loading in the csv and assign it as the
data variable.
data <- read.csv("movie.csv")
head(data)
## text
## 1 I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played "Thunderbirds" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.
## 2 When I put this movie in my DVD player, and sat down with a coke and some chips, I had some expectations. I was hoping that this movie would contain some of the strong-points of the first movie: Awsome animation, good flowing story, excellent voice cast, funny comedy and a kick-ass soundtrack. But, to my disappointment, not any of this is to be found in Atlantis: Milo's Return. Had I read some reviews first, I might not have been so let down. The following paragraph will be directed to those who have seen the first movie, and who enjoyed it primarily for the points mentioned.<br /><br />When the first scene appears, your in for a shock if you just picked Atlantis: Milo's Return from the display-case at your local videoshop (or whatever), and had the expectations I had. The music feels as a bad imitation of the first movie, and the voice cast has been replaced by a not so fitting one. (With the exception of a few characters, like the voice of Sweet). The actual drawings isnt that bad, but the animation in particular is a sad sight. The storyline is also pretty weak, as its more like three episodes of Schooby-Doo than the single adventurous story we got the last time. But dont misunderstand, it's not very good Schooby-Doo episodes. I didnt laugh a single time, although I might have sniggered once or twice.<br /><br />To the audience who haven't seen the first movie, or don't especially care for a similar sequel, here is a fast review of this movie as a stand-alone product: If you liked schooby-doo, you might like this movie. If you didn't, you could still enjoy this movie if you have nothing else to do. And I suspect it might be a good kids movie, but I wouldn't know. It might have been better if Milo's Return had been a three-episode series on a cartoon channel, or on breakfast TV.
## 3 Why do people who do not know what a particular time in the past was like feel the need to try to define that time for others? Replace Woodstock with the Civil War and the Apollo moon-landing with the Titanic sinking and you've got as realistic a flick as this formulaic soap opera populated entirely by low-life trash. Is this what kids who were too young to be allowed to go to Woodstock and who failed grade school composition do? "I'll show those old meanies, I'll put out my own movie and prove that you don't have to know nuttin about your topic to still make money!" Yeah, we already know that. The one thing watching this film did for me was to give me a little insight into underclass thinking. The next time I see a slut in a bar who looks like Diane Lane, I'm running the other way. It's child abuse to let parents that worthless raise kids. It's audience abuse to simply stick Woodstock and the moonlanding into a flick as if that ipso facto means the film portrays 1969.
## 4 Even though I have great interest in Biblical movies, I was bored to death every minute of the movie. Everything is bad. The movie is too long, the acting is most of the time a Joke and the script is horrible. I did not get the point in mixing the story about Abraham and Noah together. So if you value your time and sanity stay away from this horror.
## 5 Im a die hard Dads Army fan and nothing will ever change that. I got all the tapes, DVD's and audiobooks and every time i watch/listen to them its brand new. <br /><br />The film. The film is a re run of certain episodes, Man and the hour, Enemy within the gates, Battle School and numerous others with a different edge. Introduction of a new General instead of Captain Square was a brilliant move - especially when he wouldn't cash the cheque (something that is rarely done now).<br /><br />It follows through the early years of getting equipment and uniforms, starting up and training. All in all, its a great film for a boring Sunday afternoon. <br /><br />Two draw backs. One is the Germans bogus dodgy accents (come one, Germans cant pronounced the letter "W" like us) and Two The casting of Liz Frazer instead of the familiar Janet Davis. I like Liz in other films like the carry ons but she doesn't carry it correctly in this and Janet Davis would have been the better choice.
## 6 A terrible movie as everyone has said. What made me laugh was the cameo appearance by Scott McNealy, giving an award to one of the murdered programmers in front of a wall of SUN logos. McNealy is the CEO of SUN Microsystem, a company that practically defines itself by its hatred of Microsoft. They have been instrumental in filing antitrust complaints against Microsoft. So, were they silly enough to think this bad movie would add fuel to that fire?<br /><br />There's no public record I see of SUN's involvement, but clearly the makers of this movie know Scott McNealy. An interesting mystery.
## label
## 1 0
## 2 0
## 3 0
## 4 0
## 5 1
## 6 0
First, we tokenize the text into individual tokens with each individual token being a unit of word.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
tokens <- data %>% unnest_tokens(output = word, input = text)
head(tokens)
## label word
## 1 0 i
## 2 0 grew
## 3 0 up
## 4 0 b
## 5 0 1965
## 6 0 watching
Next, we remove stop words.
library(ggplot2)
data(stop_words) #provided from tidytext
# remove stop words
tidy_data <- tokens %>% anti_join(stop_words)
## Joining with `by = join_by(word)`
# find most common words used in reviews
tidy_data %>% count(word, sort = TRUE) %>%
head(20) %>%
# make new variable word
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) + geom_col() + labs(y = NULL)
Skimming the top 20 words used in these reviews, we see “br” is the
most commonly used word. Upon closer inspect in the data set, “br”
refers to html syntax “
” or “” to indicate line breaks. Since
this is not a valid word, we customly add this to our stop words to
remove it from the data set.
# add br as stop word
custom_stop_words <- bind_rows(tibble(word = c("br"),
lexicon = c("custom")),
stop_words)
tidy_data <- tokens %>% anti_join(custom_stop_words)
## Joining with `by = join_by(word)`
tidy_data %>% count(word, sort = TRUE) %>%
head(20) %>%
# make new variable word
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) + geom_col() + labs(y = NULL)
Knowing this is a data set of movie reviews, it makes sense for the words “movie” and “film” to be the most common words used.
Using the NRC lexicon, we label each word with their assigned sentiment(s), and we plot the distribution from most to least. Note, this visualization is limited because words may have a multiple sentiments. Evaluating words individually, we see the most common sentiment is positive words, but negative as a runner up.
sent_data <- tidy_data %>% inner_join( get_sentiments("nrc"),
by='word')
## Warning in inner_join(., get_sentiments("nrc"), by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 4 of `x` matches multiple rows in `y`.
## ℹ Row 10935 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
sent_data %>% count(sentiment, sort = TRUE) %>%
mutate(sentiment = reorder(sentiment, n)) %>%
ggplot(aes(n, sentiment)) + geom_col() + labs(y = NULL)
Let’s take a closer look into what kind of positive and negative (the top two sentiments) words are most common in the movie reviews. We can use a word cloud to better depict this.
library(reshape2)
library(wordcloud)
## Loading required package: RColorBrewer
tidy_data %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
# acast() turns the data frame into a matrix
# The cast formula has the following format: x_variable + x_2 ~ y_variable + y_2 ~ z_variable ~ ... The order of the variables makes a difference. The first varies slowest, and the last fastest.
# value.var is the name of column which stores values
acast(formula = word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red3", "skyblue"), max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 321421 of `x` matches multiple rows in `y`.
## ℹ Row 414 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
To closer analyze what a document is about, we take a look into the word frequencies. We first verify Zipf’s Law holds, which states the frequency that a word appears is inversely proportional to its rank of frequency.
len <- dim(tidy_data)[1]
freq_by_rank <- tidy_data %>%
count(word, sort = TRUE) %>%
mutate(rank = row_number(),
`term frequency` = n/len)
head(freq_by_rank)
## word n rank term frequency
## 1 movie 69718 1 0.019800040
## 2 film 61989 2 0.017604990
## 3 time 20107 3 0.005710425
## 4 story 18380 4 0.005219954
## 5 bad 14734 5 0.004184483
## 6 people 14294 6 0.004059522
freq_by_rank %>%
ggplot(aes(rank, `term frequency`)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() + scale_y_log10()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The plot shows the movie reviews do follow a linear rank-frequency relationship. Next, considering every review as a separate document, we can analyze the tf-idf.
data$doc <- c(1: dim(data)[1] )
review_words <- data %>% unnest_tokens(output = word, input = text) %>%
count(doc, word, sort = TRUE)
total_words <- review_words %>%
group_by(doc) %>%
summarize(total = sum(n))
review_words <- left_join(review_words, total_words)
## Joining with `by = join_by(doc)`
head(review_words)
## doc word n total
## 1 5879 the 198 1855
## 2 6379 the 198 2493
## 3 34168 the 180 2331
## 4 26988 br 160 1004
## 5 20021 the 128 1732
## 6 11097 br 118 756
tf_idf <- review_words %>%
bind_tf_idf(term = word, document = doc, n=n)
ordered_tf_idf <- tf_idf %>%
select(-total) %>%
arrange(desc(tf_idf))
head(ordered_tf_idf)
## doc word n tf idf tf_idf
## 1 18348 trivialboring 26 0.23008850 10.596635 2.438164
## 2 36010 primary 2 0.33333333 5.592688 1.864229
## 3 9153 smallville 3 0.20000000 7.888585 1.577717
## 4 14437 smallville 3 0.20000000 7.888585 1.577717
## 5 15400 stop.oz 23 0.11979167 10.596635 1.269389
## 6 7241 cognac 12 0.09302326 9.498022 0.883537
The most important word appear to be “trivialboring”, looking at the data set, the word appears to be copy and pasted 26 times in review number 18348. The term “stop.oz” seems to be important, but about closer examination at the document with the word, that review appears to have the sentence “OZ is the greatest show ever mad full stop.OZ is the greatest show ever mad full stop.” repeated 23 times, so that is not a good review. “Cognac” was also from a review that had the same sentence 12 times. Because of the corrections needed to be made on these results, id-tdf is probably not the most reliable method for finding important words in the data set.
We can also tokenize by n-grams, which are consecutive sequences of n words. Something to note is that tokenizing in this way will result in overlapping tokens.
movie_bigrams <- data %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
filter(!is.na(bigram))
head(movie_bigrams)
## label doc bigram
## 1 0 1 i grew
## 2 0 1 grew up
## 3 0 1 up b
## 4 0 1 b 1965
## 5 0 1 1965 watching
## 6 0 1 watching and
To remove any bigrams with stop words and find the most common bigrams, we count the words not included in the stop words.
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
bigrams_separated <- movie_bigrams %>%
# separate the bigram column in word1 and word2 columns separating by a space
separate(col = bigram, into = c("word1", "word2"), sep = " ")
# remove stop words
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% custom_stop_words$word) %>%
filter(!word2 %in% custom_stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
head(bigram_counts)
## word1 word2 n
## 1 special effects 1785
## 2 low budget 1461
## 3 sci fi 1146
## 4 real life 998
## 5 main character 863
## 6 horror movie 766
bigrams_united <- bigrams_filtered %>%
unite(col = bigram, c(word1, word2), sep = " ")
head(bigrams_united)
## label doc bigram
## 1 0 1 1965 watching
## 2 0 1 school watched
## 3 0 1 played thunderbirds
## 4 0 1 alan counting
## 5 0 1 art form
## 6 0 1 movie hoping
Looking into negated words:
AFINN <- get_sentiments("afinn")
not_words <- bigrams_separated %>%
filter(word1 == "not") %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE)
not_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(20) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by \"not\"")
Not funny” appears to be the most negated positive word in these reviews, and “not bad” is the most negated negative word in reviews.
To visualize more pairings of words simultaneously, we can use a network of nodes.
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:tidyr':
##
## crossing
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
bigram_graph <- bigram_counts %>%
head(80) %>%
# Create igraph graph object from data frame
graph_from_data_frame()
bigram_graph <- bigram_counts %>%
head(80) %>%
# Create igraph graph object from data frame
graph_from_data_frame()
set.seed(7)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
# edge_alpha to make links transparent based on how common or rare the bigram is
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
# end_cap option tells arrow to end before touching next node
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
# add theme for plotting networks
theme_void()
## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
If we were interested in words that occur within the same documents but are not adjacent to each other, we can take the count and correlations of pairs of words.
library(widyr)
words <- data %>% unnest_tokens(output = word, input = text)
word_pairs <- words %>%
filter(!word %in% custom_stop_words$word) %>%
pairwise_count(item = word, feature = doc, sort = TRUE)
head(word_pairs)
## # A tibble: 6 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 film movie 12306
## 2 movie film 12306
## 3 time movie 8838
## 4 movie time 8838
## 5 time film 8222
## 6 film time 8222
word_cors <- words %>%
filter(!word %in% custom_stop_words$word) %>%
group_by(word) %>%
filter(n() >= 1000) %>%
pairwise_cor(word, doc, sort = TRUE)
head(word_cors)
## # A tibble: 6 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 fi sci 0.983
## 2 sci fi 0.983
## 3 effects special 0.534
## 4 special effects 0.534
## 5 budget low 0.525
## 6 low budget 0.525
We can graph these highly correlated words with the correlation as weights.
set.seed(2016)
word_cors %>%
filter(correlation > .15) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
Now, we will try to use topic modeling to group the reviews into
movie genres the review was for. We first need to convert tidy format to
a document-term matrix object that the topicmodels package
takes as input.
library(topicmodels)
words_dtm <- words %>%
filter(!word %in% custom_stop_words$word) %>%
count(doc, word, sort = TRUE) %>%
cast_dtm(doc, word, n)
head(words_dtm)
## <<DocumentTermMatrix (documents: 6, terms: 107475)>>
## Non-/sparse entries: 1942/642908
## Sparsity : 100%
## Maximal term length: 66
## Weighting : term frequency (tf)
Next, we use Latent Dirichlet allocation (LDA) algorithm to estimate the mixture of words associated with k topics. We start with an arbitrary low k value of 4.
lda <- LDA(words_dtm, k = 4, control = list(seed = 1234))
topics <- tidy(lda, matrix = "beta")
head(topics)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 marty 0.0000447
## 2 2 marty 0.0000197
## 3 3 marty 0.00000967
## 4 4 marty 0.0000791
## 5 1 titanic 0.0000356
## 6 2 titanic 0.0000376
top_terms <- topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
It appears that every topic includes the usage of “movie” and “film,” which isn’t too helpful in distinguishing topics. What if we didn’t include those two words in the clustering?
words_dtm <- words %>%
filter(!word %in% custom_stop_words$word) %>%
filter(!word %in% c("movie", "movies", "film", "films")) %>%
count(doc, word, sort = TRUE) %>%
cast_dtm(doc, word, n)
lda <- LDA(words_dtm, k = 6, control = list(seed = 1234))
topics <- tidy(lda, matrix = "beta")
top_terms <- topics %>%
group_by(topic) %>%
slice_max(beta, n = 20) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
We still see redundant terms for words sharing the same root, such as
“acting” and “actor.” To discover more informative terms from these
review, we map all words to their stem and perform the LDA algorithm on
the stem words with the SnowballC package.
library(SnowballC)
stem_words <- words %>%
filter(!word %in% custom_stop_words$word) %>%
mutate(stem = wordStem(word))
stem_counts <- stem_words %>%
count(stem, sort = TRUE)
head(stem_counts)
## stem n
## 1 movi 81959
## 2 film 75018
## 3 time 25574
## 4 watch 22269
## 5 charact 22263
## 6 stori 20112
After running LDA with different k values from 2 to 10, the results for k = 6 topics looked the most distinct from each other.
words_dtm <- stem_words %>%
filter(!word %in% c("movie", "movies", "film", "films")) %>%
count(doc, stem, sort = TRUE) %>%
cast_dtm(doc, stem, n)
lda <- LDA(words_dtm, k = 6, control = list(seed = 1234))
topics <- tidy(lda, matrix = "beta")
top_terms <- topics %>%
group_by(topic) %>%
slice_max(beta, n = 20) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
We can also examine the per-document-per-topic probabilities (gamma), the estimated proportion of words from that document that are generated from that topic. We can then assign a topic to each review.
reviews_gamma <- tidy(lda, matrix = "gamma")
head(reviews_gamma)
## # A tibble: 6 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 12896 1 0.173
## 2 15253 1 0.159
## 3 5879 1 0.152
## 4 27425 1 0.175
## 5 6894 1 0.157
## 6 3146 1 0.244
We could take the original document-word pairs and find which words
in each document were assigned to which topic. The
augment() function from the broom package is a way of
tidying model output. We see the output is a tidy data frame with
review-word counts with the topic each word was assigned to.
assignments <- augment(lda, data = words_dtm)
head(assignments)
## # A tibble: 6 × 4
## document term count .topic
## <chr> <chr> <dbl> <dbl>
## 1 12896 marti 36 3
## 2 1147 marti 1 1
## 3 36627 marti 1 3
## 4 2962 marti 1 3
## 5 13264 marti 1 3
## 6 26194 marti 1 3
#count how many reviews for each topic
topic_counts <- assignments %>%
group_by(.topic) %>%
summarise(sum_count = sum(count),
.groups = 'drop')
topic_counts
## # A tibble: 6 × 2
## .topic sum_count
## <dbl> <dbl>
## 1 1 567320
## 2 2 551541
## 3 3 538637
## 4 4 629460
## 5 5 530300
## 6 6 548846
topic_counts %>%
ggplot(aes(x=.topic, y=sum_count)) +
geom_bar(stat="identity")
There appears to be about equal amount of reviews for each topic. To get a better idea of what genres these topics correspond to, we can further look into the words of each topic.genre to see what kinds of sentiments are expressed in these reviews. One thing to note is by using the BING lexicon to analyze word sentiments, there are many words not categorized, so there is a significant drop in data entries, but the most commonly used words should still show. Using wordclouds to visualize the most prominent words, the hypothesized genres of each topic are:
# assign sentiment to each term in assignments, note row count drops from 2,784,478 to 363,404 as there are many words in the reviews not included in the lexicon.
assign_topics <- assignments %>% inner_join(get_sentiments("bing"), by=c('term'='word'))
gen_cloud <- function(topic) {
topic_sub <- subset( assign_topics , .topic == topic )
topic_sub %>% group_by(term) %>%
summarise(count = sum(count)) %>%
inner_join(get_sentiments("bing"), by=c('term'='word')) %>%
acast(formula = term ~ sentiment, value.var = "count", fill = 0) %>%
comparison.cloud(colors = c("red3", "skyblue"), max.words = 100)
}
Topic 1: dystopian/science fiction
gen_cloud(1)
Topic 2: drama
gen_cloud(2)
Topic 3: romantic-comedy
gen_cloud(3)
Topic 4: horror
gen_cloud(4)
Topic 5: war/historical fiction
gen_cloud(5)
Topic 6: comedy/action
gen_cloud(6)
Another limitation of this survey is the lack of method to check if our assigned genres matched with the actual genre of the movie the review was about. The original data set included a binary “label” column, but concluded it is not very informative, because there are realistically more than two types of movie genres. Because this is an unsupervised learning analysis, we can only make inferences based on our results but can’t check the accuracy. Overall, these results show this methodology can be used to an extent to predict genre.